The HVT package is a collection of R functions to facilitate building topology preserving maps for rich multivariate data analysis. Tending towards a big data preponderance, a large number of rows. A collection of R functions for this typical workflow is organized below:
Data Compression: Vector quantization (VQ), HVQ (hierarchical vector quantization) using means or medians. This step compresses the rows (long data frame) using a compression objective.
Data Projection: Dimension projection of the compressed cells to 1D,2D or Interactive surface plot with the Sammons Non-linear Algorithm. This step creates topology preserving map (also called as embedding) coordinates into the desired output dimension.
Tessellation: Create cells required for object visualization using the Voronoi Tessellation method, package includes heatmap plots for hierarchical Voronoi tessellations (HVT). This step enables data insights, visualization, and interaction with the topology preserving map. Useful for semi-supervised tasks.
Scoring: Scoring new data sets and recording their assignment using the map objects from the above steps, in a sequence of maps if required.
Dynamic Analysis A collection of functions designed to understand and visually represent the movement of data over time within a dynamic system, with the ability to forecast the next cell (t+1) by examining its underlying flow pattern.
The Lorenz attractor is a three-dimensional figure that is generated by a set of differential equations that model a simple chaotic dynamic system of convective flow. Lorenz Attractor arises from a simplified set of equations that describe the behavior of a system involving three variables. These variables represent the state of the system at any given time and are typically denoted by (x, y, z). The equations are as follows:
\[ dx/dt = σ*(y-x) \] \[ dy/dt = x*(r -z)-y \] \[ dz/dt = x*y-β*z \] where dx/dt, dy/dt, and dz/dt represent the rates of change of x, y, and z respectively over time (t). σ, r, and β are constant parameters of the system, with σ(σ = 10) controlling the rate of convection, r(r=28) controlling the difference in temperature between the convective and stable regions, and β(β = 8/3) representing the ratio of the width to the height of the convective layer. When these equations are plotted in three-dimensional space, they produce a chaotic trajectory that never repeats. The Lorenz attractor exhibits sensitive dependence on initial conditions, meaning even small differences in the initial conditions can lead to drastically different trajectories over time. This sensitivity to initial conditions is a defining characteristic of chaotic systems.
In this notebook, we will use the
Lorenz Attractor Dataset. This dataset contains 200
thousand observations and 5 columns. The dataset can be downloaded from
here
The dataset includes the following columns:
Here is the guide to install the HVT package. This helps user to install the most recent version of the HVT package.
###direct installation###
#install.packages("HVT")
#or
###git repo installation###
#library(devtools)
#devtools::install_github(repo = "Mu-Sigma/HVT")NOTE: At the time documenting this vignette, the updated changes were not still in CRAN, hence we are sourcing the scripts from the R folder directly to the session environment.
# Sourcing required code scripts for HVT
script_dir <- "../R"
r_files <- list.files(script_dir, pattern = "\\.R$", full.names = TRUE)
invisible(lapply(r_files, function(file) { source(file, echo = FALSE); }))Here, we load the data. Let’s explore the Lorenz Attractor Dataset. For the sake of brevity we are displaying only the first ten rows.
dataset <- read.csv("./sample_dataset/lorenze_attractor.csv")
dataset <- dataset %>% dplyr::select(X,Y,Z,U,t)
dataset$t <- round(dataset$t, 5)
Table(dataset, limit = 10)| X | Y | Z | U | t |
|---|---|---|---|---|
| 0.0000000 | 1.0000000 | 20.00000 | 0.0000 | 0.00000 |
| 0.0024966 | 0.9997525 | 19.98669 | 0.0005 | 0.00025 |
| 0.0049863 | 0.9995101 | 19.97337 | 0.0010 | 0.00050 |
| 0.0074692 | 0.9992728 | 19.96006 | 0.0015 | 0.00075 |
| 0.0099454 | 0.9990405 | 19.94676 | 0.0020 | 0.00100 |
| 0.0124147 | 0.9988133 | 19.93347 | 0.0025 | 0.00125 |
| 0.0148774 | 0.9985912 | 19.92018 | 0.0030 | 0.00150 |
| 0.0173333 | 0.9983741 | 19.90691 | 0.0035 | 0.00175 |
| 0.0197826 | 0.9981621 | 19.89365 | 0.0040 | 0.00200 |
| 0.0222253 | 0.9979552 | 19.88040 | 0.0045 | 0.00225 |
Now let’s try to visualize the Lorenz attractor (overlapping spirals) in 3D Space.
data_3d <- dataset[sample(1:nrow(dataset), 1000), ]
plot_3d <- plotly::plot_ly(data_3d, x= ~X, y= ~Y, z = ~Z) %>% add_markers( marker = list(
size = 2,
symbol = "circle",
color = ~Z,
colorscale = "Bluered",
colorbar = (list(title = 'Z'))))
plot_3dFigure 1: Lorenz attractor in 3D space
Now let’s have a look at structure of the Lorenz Attractor dataset.
str(dataset)
#> 'data.frame': 200000 obs. of 5 variables:
#> $ X: num 0 0.0025 0.00499 0.00747 0.00995 ...
#> $ Y: num 1 1 1 0.999 0.999 ...
#> $ Z: num 20 20 20 20 19.9 ...
#> $ U: num 0 0.0005 0.001 0.0015 0.002 ...
#> $ t: num 0 0.00025 0.0005 0.00075 0.001 0.00125 0.0015 0.00175 0.002 0.00225 ...Data distribution
This section displays four objects.
Variable Histograms: The histogram distribution of all the variables in the dataset.
Box Plots: Box plots for each numeric column in the dataset across panels. These plots will display the median and Inter quartile Range of each column at a panel level.
Correlation Matrix: This calculates the pearson correlation which is a bivariate correlation value measuring the linear correlation between two numeric columns. The output plot is shown as a matrix.
Summary EDA: The table provides descriptive statistics for all the variables in the dataset.
It uses an inbuilt function called edaPlots to display
the above mentioned four objects.
edaPlots(dataset, time_series = TRUE, time_column = 't')| variable | min | 1st Quartile | median | mean | sd | 3rd Quartile | max | hist | n_row | n_missing |
|---|---|---|---|---|---|---|---|---|---|---|
| X | -18.0202 | -3.7356 | 0.8798 | 0.7083 | 7.8247 | 5.8663 | 16.7554 | ▂▃▇▅▃ | 2e+05 | 0 |
| Y | -24.2165 | -3.4265 | 0.7270 | 0.6957 | 9.0070 | 5.4724 | 21.8814 | ▁▂▇▃▂ | 2e+05 | 0 |
| Z | 5.6491 | 15.8927 | 21.6277 | 23.2424 | 8.8526 | 30.6142 | 44.7478 | ▃▇▅▅▂ | 2e+05 | 0 |
| U | -10.0000 | -3.9458 | 3.1532 | 1.8390 | 6.6585 | 8.1096 | 10.0000 | ▅▃▃▃▇ | 2e+05 | 0 |
| t | 0.0000 | 12.5000 | 25.0000 | 25.0000 | 14.4339 | 37.5000 | 50.0000 | ▇▇▇▇▇ | 2e+05 | 0 |
Train - Test Split
Let us split the dataset into train and test. We will orderly select 80% of the data as train and remaining as test.
noOfPoints <- dim(dataset)[1]
trainLength <- as.integer(noOfPoints * 0.8)
trainDataset <- dataset[1:trainLength,]
testDataset <- dataset[(trainLength+1):noOfPoints,]
rownames(testDataset) <- NULLLet’s have a look at the Training dataset containing 160,000 data points. For the sake of brevity we are displaying first 10 rows.
Table(trainDataset, limit = 10)| X | Y | Z | U | t |
|---|---|---|---|---|
| 0.0000000 | 1.0000000 | 20.00000 | 0.0000 | 0.00000 |
| 0.0024966 | 0.9997525 | 19.98669 | 0.0005 | 0.00025 |
| 0.0049863 | 0.9995101 | 19.97337 | 0.0010 | 0.00050 |
| 0.0074692 | 0.9992728 | 19.96006 | 0.0015 | 0.00075 |
| 0.0099454 | 0.9990405 | 19.94676 | 0.0020 | 0.00100 |
| 0.0124147 | 0.9988133 | 19.93347 | 0.0025 | 0.00125 |
| 0.0148774 | 0.9985912 | 19.92018 | 0.0030 | 0.00150 |
| 0.0173333 | 0.9983741 | 19.90691 | 0.0035 | 0.00175 |
| 0.0197826 | 0.9981621 | 19.89365 | 0.0040 | 0.00200 |
| 0.0222253 | 0.9979552 | 19.88040 | 0.0045 | 0.00225 |
Now lets have a look at structure of the training dataset.
str(trainDataset)
#> 'data.frame': 160000 obs. of 5 variables:
#> $ X: num 0 0.0025 0.00499 0.00747 0.00995 ...
#> $ Y: num 1 1 1 0.999 0.999 ...
#> $ Z: num 20 20 20 20 19.9 ...
#> $ U: num 0 0.0005 0.001 0.0015 0.002 ...
#> $ t: num 0 0.00025 0.0005 0.00075 0.001 0.00125 0.0015 0.00175 0.002 0.00225 ...Data Distribution
edaPlots(trainDataset, time_series = T, time_column = 't')| variable | min | 1st Quartile | median | mean | sd | 3rd Quartile | max | hist | n_row | n_missing |
|---|---|---|---|---|---|---|---|---|---|---|
| X | -18.0202 | -3.6928 | 1.0917 | 0.8511 | 7.8501 | 6.1564 | 16.7554 | ▂▃▇▆▃ | 160000 | 0 |
| Y | -24.2165 | -3.4047 | 0.9938 | 0.8913 | 9.0368 | 5.9268 | 21.8814 | ▁▂▇▃▂ | 160000 | 0 |
| Z | 5.6491 | 16.1278 | 21.8036 | 23.3181 | 8.7778 | 30.6148 | 44.7478 | ▃▇▆▅▂ | 160000 | 0 |
| U | -10.0000 | -5.4029 | 2.8225 | 1.4319 | 6.9893 | 8.1504 | 10.0000 | ▅▂▃▃▇ | 160000 | 0 |
| t | 0.0000 | 10.0000 | 20.0000 | 20.0000 | 11.5471 | 30.0000 | 40.0000 | ▇▇▇▇▇ | 160000 | 0 |
Let’s have a look at the Testing dataset containing 40,000 data points. For the sake of brevity we are displaying first 10 rows.
Table(testDataset, limit = 10)| X | Y | Z | U | t |
|---|---|---|---|---|
| 16.05834 | 13.65882 | 39.59945 | 9.893524 | 40.00020 |
| 16.05229 | 13.60880 | 39.62776 | 9.893451 | 40.00045 |
| 16.04613 | 13.55869 | 39.65584 | 9.893379 | 40.00070 |
| 16.03985 | 13.50850 | 39.68367 | 9.893306 | 40.00095 |
| 16.03347 | 13.45823 | 39.71126 | 9.893233 | 40.00120 |
| 16.02698 | 13.40789 | 39.73861 | 9.893160 | 40.00145 |
| 16.02037 | 13.35746 | 39.76572 | 9.893087 | 40.00170 |
| 16.01366 | 13.30696 | 39.79259 | 9.893014 | 40.00195 |
| 16.00684 | 13.25639 | 39.81921 | 9.892941 | 40.00220 |
| 15.99991 | 13.20574 | 39.84559 | 9.892868 | 40.00245 |
Now lets have a look at structure of the testing dataset.
str(testDataset)
#> 'data.frame': 40000 obs. of 5 variables:
#> $ X: num 16.1 16.1 16 16 16 ...
#> $ Y: num 13.7 13.6 13.6 13.5 13.5 ...
#> $ Z: num 39.6 39.6 39.7 39.7 39.7 ...
#> $ U: num 9.89 9.89 9.89 9.89 9.89 ...
#> $ t: num 40 40 40 40 40 ...Data Distribution
edaPlots(testDataset, time_series = TRUE, time_column = 't')| variable | min | 1st Quartile | median | mean | sd | 3rd Quartile | max | hist | n_row | n_missing |
|---|---|---|---|---|---|---|---|---|---|---|
| X | -16.2606 | -3.9065 | -0.0464 | 0.1371 | 7.6957 | 4.4283 | 16.0583 | ▂▃▇▃▂ | 40000 | 0 |
| Y | -20.9897 | -3.5599 | -0.5983 | -0.0863 | 8.8440 | 3.5431 | 19.5597 | ▂▂▇▂▂ | 40000 | 0 |
| Z | 7.9115 | 15.0266 | 20.8133 | 22.9399 | 9.1395 | 30.6121 | 41.3323 | ▆▇▅▃▅ | 40000 | 0 |
| U | -5.4402 | -0.7516 | 4.1210 | 3.4677 | 4.7921 | 7.9847 | 9.8935 | ▃▃▃▅▇ | 40000 | 0 |
| t | 40.0002 | 42.5001 | 45.0001 | 45.0001 | 2.8868 | 47.5001 | 50.0000 | ▇▇▇▇▇ | 40000 | 0 |
We will use the trainHVT function to compress our
dataset while preserving essential features.
Model Parameters
NOTE: The compression takes place only for the X, Y, Z coordinates and not for U(velocity) and t(Timestamp). After training & Scoring, we merge back the U and t column with the dataset.
set.seed(240)
hvt.results <- trainHVT(
trainDataset[,-c(4:5)],
n_cells = 100,
depth = 1,
quant.err = 0.1,
normalize = TRUE,
distance_metric = "L1_Norm",
error_metric = "max",
quant_method = "kmeans"
)Let’s checkout the compression summary .
displayTable(data = hvt.results[[3]]$compression_summary,columnName = 'percentOfCellsBelowQuantizationErrorThreshold', value = 0.8, tableType = "compression")| segmentLevel | noOfCells | noOfCellsBelowQuantizationError | percentOfCellsBelowQuantizationErrorThreshold | parameters |
|---|---|---|---|---|
| 1 | 100 | 0 | 0 | n_cells: 100 quant.err: 0.1 distance_metric: L1_Norm error_metric: max quant_method: kmeans |
NOTE: Based on the provided table, it’s evident that the ‘percentOfCellsBelowQuantizationErrorThreshold’ value is zero, indicating that compression hasn’t taken place for the specified number of cells, which is 100. Typically, we would continue increasing this value until at least 80% compression occurs. However, in this vignette demonstration, we’re not doing so because the plots generated from dynamic analysis functions would become cluttered and complex, making explanations less clear.
Now, Let’s plot the Voronoi tessellation for 100 cells.
Figure 2: The Voronoi tessellation for layer 1 shown for the 100 cells in the dataset ’lorenz attractor’
Now once we have built the model, let us try to score using our testing dataset.
NOTE: we are using the entire dataset here in place of test dataset.
set.seed(240)
scoring_var <- scoreHVT(
dataset,
hvt.results,
child.level = 1)The Flow Map functions mentioned in the next section requires Cell ID from scoring output and sorted Timestamp from the dataset we used for scoring. So we merge them both to get a modified data frame that pairs cell IDs with their respective timestamps.
Let’s see which cell and level each point belongs to with the sorted Timestamp. For the sake of brevity, we will only show the first 100 rows.
scored_data <- scoring_var[["scoredPredictedData"]] %>%round(2) %>% cbind(dataset) %>%
as.data.frame()
colnames(scored_data) <- c("Segment.Level", "Segment.Parent", "Segment.Child", "n","Cell.ID",
"Quant.Error", "pred_X", "pred_Y", "pred_Z", "centroidRadius",
"diff", "anomalyFlag", "X", "Y", "Z", "U", "t")
displayTable(data =scored_data, columnName= 'Quant.Error', value = 0.1, tableType = "summary", limit =100)| Segment.Level | Segment.Parent | Segment.Child | n | Cell.ID | Quant.Error | pred_X | pred_Y | pred_Z | centroidRadius | diff | anomalyFlag | X | Y | Z | U | t |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | 43 | 1 | 54 | 0.08 | -0.11 | 0.01 | -0.38 | 0.12 | 0.05 | 0 | 0.00 | 1.00 | 20.00 | 0.00 | 0.00 |
| 1 | 1 | 43 | 1 | 54 | 0.08 | -0.11 | 0.01 | -0.38 | 0.12 | 0.05 | 0 | 0.00 | 1.00 | 19.99 | 0.00 | 0.00 |
| 1 | 1 | 43 | 1 | 54 | 0.07 | -0.11 | 0.01 | -0.38 | 0.12 | 0.05 | 0 | 0.00 | 1.00 | 19.97 | 0.00 | 0.00 |
| 1 | 1 | 43 | 1 | 54 | 0.07 | -0.11 | 0.01 | -0.38 | 0.12 | 0.05 | 0 | 0.01 | 1.00 | 19.96 | 0.00 | 0.00 |
| 1 | 1 | 43 | 1 | 54 | 0.07 | -0.11 | 0.01 | -0.38 | 0.12 | 0.05 | 0 | 0.01 | 1.00 | 19.95 | 0.00 | 0.00 |
| 1 | 1 | 43 | 1 | 54 | 0.07 | -0.11 | 0.01 | -0.39 | 0.12 | 0.05 | 0 | 0.01 | 1.00 | 19.93 | 0.00 | 0.00 |
| 1 | 1 | 43 | 1 | 54 | 0.07 | -0.11 | 0.01 | -0.39 | 0.12 | 0.05 | 0 | 0.01 | 1.00 | 19.92 | 0.00 | 0.00 |
| 1 | 1 | 43 | 1 | 54 | 0.07 | -0.11 | 0.01 | -0.39 | 0.12 | 0.05 | 0 | 0.02 | 1.00 | 19.91 | 0.00 | 0.00 |
| 1 | 1 | 43 | 1 | 54 | 0.07 | -0.11 | 0.01 | -0.39 | 0.12 | 0.05 | 0 | 0.02 | 1.00 | 19.89 | 0.00 | 0.00 |
| 1 | 1 | 43 | 1 | 54 | 0.07 | -0.11 | 0.01 | -0.39 | 0.12 | 0.05 | 0 | 0.02 | 1.00 | 19.88 | 0.00 | 0.00 |
| 1 | 1 | 43 | 1 | 54 | 0.07 | -0.11 | 0.01 | -0.39 | 0.12 | 0.05 | 0 | 0.02 | 1.00 | 19.87 | 0.01 | 0.00 |
| 1 | 1 | 43 | 1 | 54 | 0.07 | -0.10 | 0.01 | -0.39 | 0.12 | 0.05 | 0 | 0.03 | 1.00 | 19.85 | 0.01 | 0.00 |
| 1 | 1 | 43 | 1 | 54 | 0.07 | -0.10 | 0.01 | -0.40 | 0.12 | 0.05 | 0 | 0.03 | 1.00 | 19.84 | 0.01 | 0.00 |
| 1 | 1 | 43 | 1 | 54 | 0.07 | -0.10 | 0.01 | -0.40 | 0.12 | 0.05 | 0 | 0.03 | 1.00 | 19.83 | 0.01 | 0.00 |
| 1 | 1 | 43 | 1 | 54 | 0.07 | -0.10 | 0.01 | -0.40 | 0.12 | 0.05 | 0 | 0.03 | 1.00 | 19.81 | 0.01 | 0.00 |
| 1 | 1 | 43 | 1 | 54 | 0.07 | -0.10 | 0.01 | -0.40 | 0.12 | 0.05 | 0 | 0.04 | 1.00 | 19.80 | 0.01 | 0.00 |
| 1 | 1 | 43 | 1 | 54 | 0.07 | -0.10 | 0.01 | -0.40 | 0.12 | 0.06 | 0 | 0.04 | 1.00 | 19.79 | 0.01 | 0.00 |
| 1 | 1 | 43 | 1 | 54 | 0.07 | -0.10 | 0.01 | -0.40 | 0.12 | 0.06 | 0 | 0.04 | 1.00 | 19.77 | 0.01 | 0.00 |
| 1 | 1 | 43 | 1 | 54 | 0.07 | -0.10 | 0.01 | -0.41 | 0.12 | 0.06 | 0 | 0.04 | 1.00 | 19.76 | 0.01 | 0.00 |
| 1 | 1 | 43 | 1 | 54 | 0.06 | -0.10 | 0.01 | -0.41 | 0.12 | 0.06 | 0 | 0.05 | 1.00 | 19.75 | 0.01 | 0.00 |
| 1 | 1 | 43 | 1 | 54 | 0.06 | -0.10 | 0.01 | -0.41 | 0.12 | 0.06 | 0 | 0.05 | 1.00 | 19.74 | 0.01 | 0.00 |
| 1 | 1 | 43 | 1 | 54 | 0.06 | -0.10 | 0.01 | -0.41 | 0.12 | 0.06 | 0 | 0.05 | 1.00 | 19.72 | 0.01 | 0.01 |
| 1 | 1 | 43 | 1 | 54 | 0.06 | -0.10 | 0.01 | -0.41 | 0.12 | 0.06 | 0 | 0.05 | 1.00 | 19.71 | 0.01 | 0.01 |
| 1 | 1 | 43 | 1 | 54 | 0.06 | -0.10 | 0.01 | -0.41 | 0.12 | 0.06 | 0 | 0.06 | 1.00 | 19.70 | 0.01 | 0.01 |
| 1 | 1 | 43 | 1 | 54 | 0.06 | -0.10 | 0.01 | -0.41 | 0.12 | 0.06 | 0 | 0.06 | 1.00 | 19.68 | 0.01 | 0.01 |
| 1 | 1 | 43 | 1 | 54 | 0.06 | -0.10 | 0.01 | -0.42 | 0.12 | 0.06 | 0 | 0.06 | 1.00 | 19.67 | 0.01 | 0.01 |
| 1 | 1 | 43 | 1 | 54 | 0.06 | -0.10 | 0.01 | -0.42 | 0.12 | 0.06 | 0 | 0.06 | 1.00 | 19.66 | 0.01 | 0.01 |
| 1 | 1 | 43 | 1 | 54 | 0.06 | -0.10 | 0.01 | -0.42 | 0.12 | 0.06 | 0 | 0.07 | 1.00 | 19.64 | 0.01 | 0.01 |
| 1 | 1 | 43 | 1 | 54 | 0.06 | -0.10 | 0.01 | -0.42 | 0.12 | 0.06 | 0 | 0.07 | 0.99 | 19.63 | 0.01 | 0.01 |
| 1 | 1 | 43 | 1 | 54 | 0.06 | -0.10 | 0.01 | -0.42 | 0.12 | 0.06 | 0 | 0.07 | 0.99 | 19.62 | 0.01 | 0.01 |
| 1 | 1 | 43 | 1 | 54 | 0.06 | -0.10 | 0.01 | -0.42 | 0.12 | 0.06 | 0 | 0.07 | 0.99 | 19.60 | 0.02 | 0.01 |
| 1 | 1 | 43 | 1 | 54 | 0.06 | -0.10 | 0.01 | -0.42 | 0.12 | 0.06 | 0 | 0.07 | 0.99 | 19.59 | 0.02 | 0.01 |
| 1 | 1 | 43 | 1 | 54 | 0.06 | -0.10 | 0.01 | -0.43 | 0.12 | 0.07 | 0 | 0.08 | 0.99 | 19.58 | 0.02 | 0.01 |
| 1 | 1 | 43 | 1 | 54 | 0.06 | -0.10 | 0.01 | -0.43 | 0.12 | 0.07 | 0 | 0.08 | 0.99 | 19.57 | 0.02 | 0.01 |
| 1 | 1 | 43 | 1 | 54 | 0.06 | -0.10 | 0.01 | -0.43 | 0.12 | 0.07 | 0 | 0.08 | 0.99 | 19.55 | 0.02 | 0.01 |
| 1 | 1 | 43 | 1 | 54 | 0.06 | -0.10 | 0.01 | -0.43 | 0.12 | 0.07 | 0 | 0.08 | 0.99 | 19.54 | 0.02 | 0.01 |
| 1 | 1 | 43 | 1 | 54 | 0.05 | -0.10 | 0.01 | -0.43 | 0.12 | 0.07 | 0 | 0.09 | 0.99 | 19.53 | 0.02 | 0.01 |
| 1 | 1 | 43 | 1 | 54 | 0.05 | -0.10 | 0.01 | -0.43 | 0.12 | 0.07 | 0 | 0.09 | 0.99 | 19.51 | 0.02 | 0.01 |
| 1 | 1 | 43 | 1 | 54 | 0.05 | -0.10 | 0.01 | -0.44 | 0.12 | 0.07 | 0 | 0.09 | 0.99 | 19.50 | 0.02 | 0.01 |
| 1 | 1 | 43 | 1 | 54 | 0.05 | -0.10 | 0.01 | -0.44 | 0.12 | 0.07 | 0 | 0.09 | 0.99 | 19.49 | 0.02 | 0.01 |
| 1 | 1 | 43 | 1 | 54 | 0.05 | -0.10 | 0.01 | -0.44 | 0.12 | 0.07 | 0 | 0.09 | 0.99 | 19.47 | 0.02 | 0.01 |
| 1 | 1 | 43 | 1 | 54 | 0.05 | -0.10 | 0.01 | -0.44 | 0.12 | 0.07 | 0 | 0.10 | 0.99 | 19.46 | 0.02 | 0.01 |
| 1 | 1 | 43 | 1 | 54 | 0.05 | -0.10 | 0.01 | -0.44 | 0.12 | 0.07 | 0 | 0.10 | 0.99 | 19.45 | 0.02 | 0.01 |
| 1 | 1 | 43 | 1 | 54 | 0.05 | -0.10 | 0.01 | -0.44 | 0.12 | 0.07 | 0 | 0.10 | 0.99 | 19.44 | 0.02 | 0.01 |
| 1 | 1 | 43 | 1 | 54 | 0.05 | -0.10 | 0.01 | -0.44 | 0.12 | 0.07 | 0 | 0.10 | 0.99 | 19.42 | 0.02 | 0.01 |
| 1 | 1 | 43 | 1 | 54 | 0.05 | -0.09 | 0.01 | -0.45 | 0.12 | 0.07 | 0 | 0.11 | 0.99 | 19.41 | 0.02 | 0.01 |
| 1 | 1 | 43 | 1 | 54 | 0.05 | -0.09 | 0.01 | -0.45 | 0.12 | 0.07 | 0 | 0.11 | 0.99 | 19.40 | 0.02 | 0.01 |
| 1 | 1 | 43 | 1 | 54 | 0.05 | -0.09 | 0.01 | -0.45 | 0.12 | 0.07 | 0 | 0.11 | 0.99 | 19.38 | 0.02 | 0.01 |
| 1 | 1 | 43 | 1 | 54 | 0.05 | -0.09 | 0.01 | -0.45 | 0.12 | 0.07 | 0 | 0.11 | 0.99 | 19.37 | 0.02 | 0.01 |
| 1 | 1 | 43 | 1 | 54 | 0.05 | -0.09 | 0.01 | -0.45 | 0.12 | 0.08 | 0 | 0.11 | 0.99 | 19.36 | 0.02 | 0.01 |
| 1 | 1 | 43 | 1 | 54 | 0.05 | -0.09 | 0.01 | -0.45 | 0.12 | 0.08 | 0 | 0.12 | 0.99 | 19.35 | 0.03 | 0.01 |
| 1 | 1 | 43 | 1 | 54 | 0.05 | -0.09 | 0.01 | -0.45 | 0.12 | 0.08 | 0 | 0.12 | 0.99 | 19.33 | 0.03 | 0.01 |
| 1 | 1 | 43 | 1 | 54 | 0.05 | -0.09 | 0.01 | -0.46 | 0.12 | 0.08 | 0 | 0.12 | 0.99 | 19.32 | 0.03 | 0.01 |
| 1 | 1 | 43 | 1 | 54 | 0.04 | -0.09 | 0.01 | -0.46 | 0.12 | 0.08 | 0 | 0.12 | 0.99 | 19.31 | 0.03 | 0.01 |
| 1 | 1 | 43 | 1 | 54 | 0.04 | -0.09 | 0.01 | -0.46 | 0.12 | 0.08 | 0 | 0.13 | 0.99 | 19.29 | 0.03 | 0.01 |
| 1 | 1 | 43 | 1 | 54 | 0.04 | -0.09 | 0.01 | -0.46 | 0.12 | 0.08 | 0 | 0.13 | 0.99 | 19.28 | 0.03 | 0.01 |
| 1 | 1 | 43 | 1 | 54 | 0.04 | -0.09 | 0.01 | -0.46 | 0.12 | 0.08 | 0 | 0.13 | 0.99 | 19.27 | 0.03 | 0.01 |
| 1 | 1 | 43 | 1 | 54 | 0.04 | -0.09 | 0.01 | -0.46 | 0.12 | 0.08 | 0 | 0.13 | 0.99 | 19.26 | 0.03 | 0.01 |
| 1 | 1 | 43 | 1 | 54 | 0.04 | -0.09 | 0.01 | -0.46 | 0.12 | 0.08 | 0 | 0.13 | 0.99 | 19.24 | 0.03 | 0.01 |
| 1 | 1 | 43 | 1 | 54 | 0.04 | -0.09 | 0.01 | -0.47 | 0.12 | 0.08 | 0 | 0.14 | 0.99 | 19.23 | 0.03 | 0.01 |
| 1 | 1 | 43 | 1 | 54 | 0.04 | -0.09 | 0.01 | -0.47 | 0.12 | 0.08 | 0 | 0.14 | 0.99 | 19.22 | 0.03 | 0.01 |
| 1 | 1 | 43 | 1 | 54 | 0.04 | -0.09 | 0.01 | -0.47 | 0.12 | 0.08 | 0 | 0.14 | 0.99 | 19.20 | 0.03 | 0.02 |
| 1 | 1 | 43 | 1 | 54 | 0.04 | -0.09 | 0.01 | -0.47 | 0.12 | 0.08 | 0 | 0.14 | 0.99 | 19.19 | 0.03 | 0.02 |
| 1 | 1 | 43 | 1 | 54 | 0.04 | -0.09 | 0.01 | -0.47 | 0.12 | 0.08 | 0 | 0.15 | 0.99 | 19.18 | 0.03 | 0.02 |
| 1 | 1 | 43 | 1 | 54 | 0.04 | -0.09 | 0.01 | -0.47 | 0.12 | 0.08 | 0 | 0.15 | 0.99 | 19.17 | 0.03 | 0.02 |
| 1 | 1 | 43 | 1 | 54 | 0.04 | -0.09 | 0.01 | -0.47 | 0.12 | 0.08 | 0 | 0.15 | 0.99 | 19.15 | 0.03 | 0.02 |
| 1 | 1 | 43 | 1 | 54 | 0.04 | -0.09 | 0.01 | -0.48 | 0.12 | 0.08 | 0 | 0.15 | 0.99 | 19.14 | 0.03 | 0.02 |
| 1 | 1 | 43 | 1 | 54 | 0.04 | -0.09 | 0.01 | -0.48 | 0.12 | 0.09 | 0 | 0.15 | 0.99 | 19.13 | 0.03 | 0.02 |
| 1 | 1 | 43 | 1 | 54 | 0.04 | -0.09 | 0.01 | -0.48 | 0.12 | 0.09 | 0 | 0.16 | 0.99 | 19.11 | 0.03 | 0.02 |
| 1 | 1 | 43 | 1 | 54 | 0.04 | -0.09 | 0.01 | -0.48 | 0.12 | 0.09 | 0 | 0.16 | 0.99 | 19.10 | 0.03 | 0.02 |
| 1 | 1 | 43 | 1 | 54 | 0.03 | -0.09 | 0.01 | -0.48 | 0.12 | 0.09 | 0 | 0.16 | 0.99 | 19.09 | 0.04 | 0.02 |
| 1 | 1 | 43 | 1 | 54 | 0.03 | -0.09 | 0.01 | -0.48 | 0.12 | 0.09 | 0 | 0.16 | 1.00 | 19.08 | 0.04 | 0.02 |
| 1 | 1 | 43 | 1 | 54 | 0.03 | -0.09 | 0.01 | -0.48 | 0.12 | 0.09 | 0 | 0.16 | 1.00 | 19.06 | 0.04 | 0.02 |
| 1 | 1 | 43 | 1 | 54 | 0.03 | -0.09 | 0.01 | -0.49 | 0.12 | 0.09 | 0 | 0.17 | 1.00 | 19.05 | 0.04 | 0.02 |
| 1 | 1 | 43 | 1 | 54 | 0.03 | -0.09 | 0.01 | -0.49 | 0.12 | 0.09 | 0 | 0.17 | 1.00 | 19.04 | 0.04 | 0.02 |
| 1 | 1 | 43 | 1 | 54 | 0.03 | -0.09 | 0.01 | -0.49 | 0.12 | 0.09 | 0 | 0.17 | 1.00 | 19.03 | 0.04 | 0.02 |
| 1 | 1 | 43 | 1 | 54 | 0.03 | -0.09 | 0.01 | -0.49 | 0.12 | 0.09 | 0 | 0.17 | 1.00 | 19.01 | 0.04 | 0.02 |
| 1 | 1 | 43 | 1 | 54 | 0.03 | -0.09 | 0.01 | -0.49 | 0.12 | 0.09 | 0 | 0.17 | 1.00 | 19.00 | 0.04 | 0.02 |
| 1 | 1 | 43 | 1 | 54 | 0.03 | -0.09 | 0.01 | -0.49 | 0.12 | 0.09 | 0 | 0.18 | 1.00 | 18.99 | 0.04 | 0.02 |
| 1 | 1 | 43 | 1 | 54 | 0.03 | -0.09 | 0.01 | -0.49 | 0.12 | 0.09 | 0 | 0.18 | 1.00 | 18.98 | 0.04 | 0.02 |
| 1 | 1 | 43 | 1 | 54 | 0.03 | -0.09 | 0.01 | -0.50 | 0.12 | 0.09 | 0 | 0.18 | 1.00 | 18.96 | 0.04 | 0.02 |
| 1 | 1 | 43 | 1 | 54 | 0.03 | -0.09 | 0.01 | -0.50 | 0.12 | 0.09 | 0 | 0.18 | 1.00 | 18.95 | 0.04 | 0.02 |
| 1 | 1 | 43 | 1 | 54 | 0.03 | -0.08 | 0.01 | -0.50 | 0.12 | 0.09 | 0 | 0.18 | 1.00 | 18.94 | 0.04 | 0.02 |
| 1 | 1 | 43 | 1 | 54 | 0.03 | -0.08 | 0.01 | -0.50 | 0.12 | 0.09 | 0 | 0.19 | 1.00 | 18.93 | 0.04 | 0.02 |
| 1 | 1 | 43 | 1 | 54 | 0.03 | -0.08 | 0.01 | -0.50 | 0.12 | 0.10 | 0 | 0.19 | 1.00 | 18.91 | 0.04 | 0.02 |
| 1 | 1 | 43 | 1 | 54 | 0.03 | -0.08 | 0.01 | -0.50 | 0.12 | 0.10 | 0 | 0.19 | 1.00 | 18.90 | 0.04 | 0.02 |
| 1 | 1 | 43 | 1 | 54 | 0.03 | -0.08 | 0.01 | -0.50 | 0.12 | 0.10 | 0 | 0.19 | 1.00 | 18.89 | 0.04 | 0.02 |
| 1 | 1 | 43 | 1 | 54 | 0.03 | -0.08 | 0.01 | -0.51 | 0.12 | 0.10 | 0 | 0.19 | 1.00 | 18.88 | 0.04 | 0.02 |
| 1 | 1 | 43 | 1 | 54 | 0.03 | -0.08 | 0.01 | -0.51 | 0.12 | 0.10 | 0 | 0.20 | 1.00 | 18.86 | 0.04 | 0.02 |
| 1 | 1 | 43 | 1 | 54 | 0.03 | -0.08 | 0.01 | -0.51 | 0.12 | 0.10 | 0 | 0.20 | 1.00 | 18.85 | 0.04 | 0.02 |
| 1 | 1 | 43 | 1 | 54 | 0.03 | -0.08 | 0.01 | -0.51 | 0.12 | 0.09 | 0 | 0.20 | 1.00 | 18.84 | 0.05 | 0.02 |
| 1 | 1 | 43 | 1 | 54 | 0.03 | -0.08 | 0.01 | -0.51 | 0.12 | 0.09 | 0 | 0.20 | 1.00 | 18.83 | 0.05 | 0.02 |
| 1 | 1 | 43 | 1 | 54 | 0.03 | -0.08 | 0.01 | -0.51 | 0.12 | 0.09 | 0 | 0.20 | 1.00 | 18.81 | 0.05 | 0.02 |
| 1 | 1 | 43 | 1 | 54 | 0.03 | -0.08 | 0.01 | -0.51 | 0.12 | 0.09 | 0 | 0.21 | 1.00 | 18.80 | 0.05 | 0.02 |
| 1 | 1 | 43 | 1 | 54 | 0.03 | -0.08 | 0.01 | -0.52 | 0.12 | 0.09 | 0 | 0.21 | 1.00 | 18.79 | 0.05 | 0.02 |
| 1 | 1 | 43 | 1 | 54 | 0.03 | -0.08 | 0.01 | -0.52 | 0.12 | 0.09 | 0 | 0.21 | 1.00 | 18.78 | 0.05 | 0.02 |
| 1 | 1 | 43 | 1 | 54 | 0.03 | -0.08 | 0.01 | -0.52 | 0.12 | 0.09 | 0 | 0.21 | 1.00 | 18.76 | 0.05 | 0.02 |
| 1 | 1 | 43 | 1 | 54 | 0.03 | -0.08 | 0.01 | -0.52 | 0.12 | 0.09 | 0 | 0.21 | 1.00 | 18.75 | 0.05 | 0.02 |
| 1 | 1 | 43 | 1 | 54 | 0.03 | -0.08 | 0.01 | -0.52 | 0.12 | 0.09 | 0 | 0.22 | 1.00 | 18.74 | 0.05 | 0.02 |
| 1 | 1 | 43 | 1 | 54 | 0.03 | -0.08 | 0.01 | -0.52 | 0.12 | 0.09 | 0 | 0.22 | 1.00 | 18.73 | 0.05 | 0.02 |
Let’s comprehend the function plotStateTransition which
is used to create a time series plotly object.
plotStateTransition(
df,
sample_size,
line_plot,
cellid_column,
time_column
)df - A dataframe contains Cell ID
and Timestamps.
sample_size - A numeric value to
specify the sampling value which ranges between 0.1 to 1. The highest
value 1, outputs a plot with the entire dataset. Sampling of data takes
place from the last to first.
line_plot - A Logical value. If
TRUE, the output will be a timeseries plot with a line connecting the
states according to the sample_size. If FALSE, a timeseries plot but
without a line based on the sample_size will be the output.
cellid_column - A Character
specifying the column name of Cell ID from the dataframe passed to this
function.
time_column - A Character
specifying the column name of timestamp from the dataframe passed to
this function.
plotStateTransition(df = scored_data, cellid_column = "Cell.ID", time_column = "t", sample_size = 1)getTransitionProbability(
df,
cellid_column,
time_column)df - A dataframe contains Cell ID
and Timestamps.
cellid_column - A Character
specifing the column name of Cell ID from the dataframe passed to this
function.
time_column - A Character specifing
the column name of timestamp from the dataframe passed to this
function.
This function displays probability for Tplus1 states for all cells in the form of table. For the sake of brevity we are displaying the probability table for the Cell ID 1 to 5.
trans_table <- getTransitionProbability(df = scored_data, cellid_column = "Cell.ID", time_column = "t")NOTE: The output is stored as a nested list which provides facility to handle it with Cell IDs. For the purpose of demo, here we are displaying it as dataframe with first 100 rows.
names_list <- lapply(trans_table, names)
common_names <- Reduce(intersect, names_list)
trans_table_list <- lapply(trans_table, function(df) df[common_names])
combined_df <- do.call(rbind, trans_table_list)
combined_df$Next_State <- as.numeric(combined_df$Next_State)
Table(combined_df, limit = 10)| Current_State | Next_State | Relative_Frequency | Probability_Percentage |
|---|---|---|---|
| 1 | 1 | 1296 | 0.9878 |
| 1 | 4 | 14 | 0.0107 |
| 1 | 6 | 2 | 0.0015 |
| 2 | 1 | 16 | 0.0082 |
| 2 | 2 | 1917 | 0.9866 |
| 2 | 6 | 9 | 0.0046 |
| 2 | 10 | 1 | 0.0005 |
| 3 | 2 | 24 | 0.0125 |
| 3 | 3 | 1901 | 0.9865 |
| 3 | 10 | 2 | 0.0010 |
reconcileTransitionProbability(
df,
hmap_type = "All",
cellid_column,
time_column)df - A dataframe with scoring
output and along with the dataset we used for scoreHVT
function.
hmap_type - If set to
‘without_self_state’, reconciliation plots for manual and Markovchain
for highest transition probability excluding the self-state is given as
output, if set to ‘with_self_state’, reconciliation plots for manual and
Markovchain for highest transition probability considering the
self-state is given as output and if set to ‘All’, plots including and
excluding self-state is given as output.
cellid_column - A Character
specifying the column name of Cell ID from the dataframe you pass to
this function.
time_column - A Charcater
specifying the column name of timestamp from the dataframe you pass to
this function.
reconcile_plots <- reconcileTransitionProbability(df = scored_data, hmap_type = "All", cellid_column = "Cell.ID", time_column = "t")Manual reconciliation of transition probability with self-state
The darker diagonal cells indicate higher probabilities of staying in the same state. These transitions represent situations where there is no change from the current state to the next state. Such states might be attractors in a dynamic system, where the system naturally tends to return to these states even after minor perturbations.
reconcile_plots[[1]]Manual reconciliation of transition probability without self-state
In this plot, the transitions suggest that the states tend to move to neighboring states more frequently. Proximity might not only refer to physical distance but also to similarities in attributes or conditions.
reconcile_plots[[2]]Markovchain reconciliation of transition probability with self-state
This plot uses the same data from the manual reconciliation process to determine the probability using self-state using the markovchainFit function.
reconcile_plots[[3]]Markovchain reconciliation of transition probability without self-state
This plot uses the same data from the manual reconciliation process to determine the probability using self-state using the markovchainFit function.
reconcile_plots[[4]]plotAnimatedFlowmap(
hvt_model_output,
transition_probability_df,
df,
animation = "All",
flow_map = "All",
animation_speed = 2,
threshold = 0.6,
cellid_column,
time_column )hvt_model_output - The List object
which is the output from trainHVT function.
transition_probability_df - A list
of dataframe which is the output from the
getTransitionProbability function.
df - A dataframe with scoring
output and along with the dataset we used for scoreHVT
function.
animation - Character. If set to
‘time_based’, dot animation for state transition with sorted Timestamp
is the output. If set to ‘state_based’, arrow animation based on highest
state excluding self-state will be the output. If set to ‘All’, both the
animation will be resulted.
flow_map - Character. If set to
‘self_state’, dot flowmap for next state based on highest transition
probability will be the output. If set to ‘probability’, arrow flowmap
with arrow-size based on their probability pointing to next state based
on highest transition probability excluding self-state will be the
output. If set to ‘All’, all two flowmaps will be resulted.
fps_for_time - A Numeric value
indicating the frames per second for time based animated flowmap. (Must
be numeric value and a factor of 100). Default value is 1.
fps_for_state - A Numeric value
indicating the frames per second for state based animated flowmap. (Must
be numeric value and a factor of 100). Default value is 1.
time_duration - A Numeric value
indicating the total duration(in seconds) of the gif for time based
animated flowmap. Default value is 2.
state_duration - A Numeric value
indicating the total duration(in seconds) of the gif for state based
animated flowmap. Default value is 2.
threshold - A Numeric value ranges
between 0.1 to 1 to control the categorization of probability values
into “High Probability” and “Low Probability” for the flow map type
“Probability”.
cellid_column - A Character
specifying the column name of Cell ID from the dataframe you pass to
this function.
time_column - A Character
specifying the column name of timestamp from the dataframe you pass to
this function.
plots <- plotAnimatedFlowmap(hvt_model_output = hvt.results, transition_probability_df = trans_table, df = scored_data, animation = "All", flow_map = "All", fps_time = 10, fps_state = 2, time_duration = 200,state_duration = 50, threshold = 0.7, cellid_column = "Cell.ID", time_column = "t",duration = 50)Flow map: Highest transition probability considering self-state
Circle around the centroid represents self-state Probability.
Flow Map: Highest transition probability excluding self-states - Arrow size represents transition probability
Arrow segment length based on Probability.
Flow map animation: Highest state transition probabilities (Including self-states)